The 2019-2020 Coronavirus Pandemic Analysis

Contact: Smith Research

BACKGROUND & APPROACH

I wanted to track and trend the coronavirus outbreak on my own curiosity. There are some interesting questions that may fall out of this, as it is a very historic moment, including scientifically and analytically (we have a large amount of data being shared across the globe, analyzed in real-time). The world has come to a halt because of it.
This analysis attempts to answer the following questions (more to come):

  1. What does the trend of the pandemic look like to date?
  2. What are future case predictions based on historical model?
  3. What interesting quirks or patterns emerge?

ASSUMPTIONS & LIMITATIONS: * This data is limited by the source. I realized early on that depending on source there were conflicting # of cases. Originally I was using JHU data… but this was always ‘ahead’ of the Our World In Data. I noticed that JHU’s website was buggy- you clicked on the U.S. stats but it didn’t reflect the U.S.. So I changed data sources to be more consistent with what is presented in the media (and Our World In Data has more extensive plots I can compare my own to). An interesting aside might be why the discrepancy? Was I missing something?
* Defintiions are important as is the idea that multiple varibales accumulate in things like total cases (more testing for example).

SOURCE RAW DATA: * https://ourworldindata.org/coronavirus
* https://github.com/CSSEGISandData/COVID-19/
*

INPUT DATA LOCATION: github (https://github.com/sbs87/coronavirus/tree/master/data)

OUTPUT DATA LOCATIOn: github (https://github.com/sbs87/coronavirus/tree/master/results)

TIMESTAMP

Start: ##—— Fri Jun 5 17:45:52 2020 ——##

PRE-ANALYSIS

The following sections are outside the scope of the ‘analysis’ but are still needed to prepare everything

UPSTREAM PROCESSING/ANALYSIS

  1. Google Mobility Scraping, script available at get_google_mobility.py
# Mobility data has to be extracted from Google PDF reports using a web scraping script (python , written by Peter Simone, https://github.com/petersim1/MIT_COVID19)

# See get_google_mobility.py for local script 

python3 get_google_mobility.py
# writes csv file of mobility data as "mobility.csv"

SET UP ENVIORNMENT

Load libraries and set global variables

# timestamp start
timestamp()
## ##------ Fri Jun  5 17:45:52 2020 ------##

# clear previous enviornment
rm(list = ls())

##------------------------------------------
## LIBRARIES
##------------------------------------------
library(plyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange()   masks plyr::arrange()
## x purrr::compact()   masks plyr::compact()
## x dplyr::count()     masks plyr::count()
## x dplyr::failwith()  masks plyr::failwith()
## x dplyr::filter()    masks stats::filter()
## x dplyr::id()        masks plyr::id()
## x dplyr::lag()       masks stats::lag()
## x dplyr::mutate()    masks plyr::mutate()
## x dplyr::rename()    masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(plot.utils)
library(utils)
library(knitr)

##------------------------------------------

##------------------------------------------
# GLOBAL VARIABLES
##------------------------------------------
user_name <- Sys.info()["user"]
working_dir <- paste0("/Users/", user_name, "/Projects/coronavirus/")  # don't forget trailing /
results_dir <- paste0(working_dir, "results/")  # assumes diretory exists
results_dir_custom <- paste0(results_dir, "custom/")  # assumes diretory exists


Corona_Cases.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
Corona_Cases.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv"
Corona_Deaths.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv"
Corona_Deaths.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"

Corona_Cases.fn <- paste0(working_dir, "data/", basename(Corona_Cases.source_url))
Corona_Cases.US.fn <- paste0(working_dir, "data/", basename(Corona_Cases.US.source_url))
Corona_Deaths.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.source_url))
Corona_Deaths.US.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.US.source_url))
default_theme <- theme_bw() + theme(text = element_text(size = 14))  # fix this
##------------------------------------------

FUNCTIONS

List of functions

function_name description
prediction_model outputs case estumate for given log-linear moder parameters slope and intercept
make_long converts input data to long format (specialized cases)
name_overlaps outputs the column names intersection and set diffs of two data frame
find_linear_index finds the first date at which linearaity occurs
##------------------------------------------
## FUNCTION: prediction_model
##------------------------------------------
## --- //// ----
# Takes days vs log10 (case) linear model parameters and a set of days since 100 cases and outputs a dataframe with total number of predicted cases for those days
## --- //// ----
prediction_model<-function(m=1,b=0,days=1){
  total_cases<-m*days+b
  total_cases.log<-log(total_cases,10)
  prediction<-data.frame(days=days,Total_confirmed_cases_perstate=total_cases)
  return(prediction)
}
##------------------------------------------

##------------------------------------------
## FUNCTION: make_long
##------------------------------------------
## --- //// ----
# Takes wide-format case data and converts into long format, using date and total cases as variable/values. Also enforces standardization/assumes data struture naming by using fixed variable name, value name, id.vars, 
## --- //// ----
make_long<-function(data_in,variable.name = "Date",
                   value.name = "Total_confirmed_cases",
                   id.vars=c("case_type","Province.State","Country.Region","Lat","Long","City","Population")){

long_data<-melt(data_in,
                id.vars = id.vars,
                variable.name=variable.name,
                value.name=value.name)
return(long_data)

}
##------------------------------------------

## THIS WILL BE IN UTILS AT SOME POINT
name_overlaps<-function(df1,df2){
i<-intersect(names(df1),
names(df2))
sd1<-setdiff(names(df1),
names(df2))
sd2<-setdiff(names(df2),names(df1))
cat("intersection:\n",paste(i,"\n"))
cat("in df1 but not df2:\n",paste(sd1,"\n"))
cat("in df2 but not df1:\n",paste(sd2,"\n"))
return(list("int"=i,"sd_1_2"=sd1,"sd_2_1"=sd2))
}

##------------------------------------------

##------------------------------------------
## FUNCTION: find_linear_index
##------------------------------------------
## --- //// ----
# Find date at which total case data is linear (for a given data frame) 
## --- //// ----

find_linear_index<-function(tmp,running_avg=5){
  tmp$Total_confirmed_cases_perstate.log<-log(tmp$Total_confirmed_cases_perstate,2)
  derivative<-data.frame(matrix(nrow = nrow(tmp),ncol = 4))
  names(derivative)<-c("m.time","mm.time","cumsum","date")
  
  # First derivative
  for(t in 2:nrow(tmp)){
    slope.t<- tmp[t,"Total_confirmed_cases_perstate.log"]- tmp[t-1,"Total_confirmed_cases_perstate.log"]
    derivative[t,"m.time"]<-slope.t
    derivative[t,"date"]<-as.Date(tmp[t,"Date"])
  }
  
  # Second derivative
  for(t in 2:nrow(derivative)){
    slope.t<- derivative[t,"m.time"]- derivative[t-1,"m.time"]
    derivative[t,"mm.time"]<-slope.t
  }
  
  #Compute running sum of second derivative (window = 5). Choose point at which within 0.2
  for(t in running_avg:nrow(derivative)){
    slope.t<- sum(abs(derivative[t:(t-4),"mm.time"])<0.2,na.rm = T)
    derivative[t,"cumsum"]<-slope.t
  }
  
  #Find date -5 from the stablility point
  linear_begin<-min(derivative[!is.na(derivative$cumsum) & derivative$cumsum==running_avg,"date"])-running_avg
  
  return(linear_begin)
}

READ IN DATA

# Q: do we want to archive previous versions? Maybe an auto git mv?

##------------------------------------------
## Download and read in latest data from github
##------------------------------------------
download.file(Corona_Cases.source_url, destfile = Corona_Cases.fn)
Corona_Totals.raw <- read.csv(Corona_Cases.fn, header = T, stringsAsFactors = F)

download.file(Corona_Cases.US.source_url, destfile = Corona_Cases.US.fn)
Corona_Totals.US.raw <- read.csv(Corona_Cases.US.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.source_url, destfile = Corona_Deaths.fn)
Corona_Deaths.raw <- read.csv(Corona_Deaths.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.US.source_url, destfile = Corona_Deaths.US.fn)
Corona_Deaths.US.raw <- read.csv(Corona_Deaths.US.fn, header = T, stringsAsFactors = F)

# latest date on all data:
paste("US deaths:", names(Corona_Deaths.US.raw)[ncol(Corona_Deaths.US.raw)])
## [1] "US deaths: X6.4.20"
paste("US total:", names(Corona_Totals.US.raw)[ncol(Corona_Totals.US.raw)])
## [1] "US total: X6.4.20"
paste("World deaths:", names(Corona_Deaths.raw)[ncol(Corona_Deaths.raw)])
## [1] "World deaths: X6.4.20"
paste("World total:", names(Corona_Totals.raw)[ncol(Corona_Totals.raw)])
## [1] "World total: X6.4.20"

PROCESS DATA

  • Convert to long format
  • Fix date formatting/convert to numeric date
  • Log10 transform total # cases
##------------------------------------------
## Combine death and total data frames
##------------------------------------------
Corona_Totals.raw$case_type<-"total"
Corona_Totals.US.raw$case_type<-"total"
Corona_Deaths.raw$case_type<-"death"
Corona_Deaths.US.raw$case_type<-"death"

# for some reason, Population listed in US death file but not for other data... Weird. When combining, all datasets will have this column, but US deaths is the only useful one.  
Corona_Totals.US.raw$Population<-"NA" 
Corona_Totals.raw$Population<-"NA"
Corona_Deaths.raw$Population<-"NA"

Corona_Cases.raw<-rbind(Corona_Totals.raw,Corona_Deaths.raw)
Corona_Cases.US.raw<-rbind(Corona_Totals.US.raw,Corona_Deaths.US.raw)
#TODO: custom utils- setdiff, intersect names... option to output in merging too
##------------------------------------------
# prepare raw datasets for eventual combining
##------------------------------------------
Corona_Cases.raw$City<-"NA" # US-level data has Cities
Corona_Cases.US.raw$Country_Region<-"US_state" # To differentiate from World-level stats

Corona_Cases.US.raw<-plyr::rename(Corona_Cases.US.raw,c("Province_State"="Province.State",
                                                  "Country_Region"="Country.Region",
                                                  "Long_"="Long",
                                                  "Admin2"="City"))


##------------------------------------------
## Convert to long format
##------------------------------------------
#JHU has a gross file format. It's in wide format with each column is the date in MM/DD/YY. So read this in as raw data but trasnform it to be better suited for analysis
# Furthermore, the World and US level data is formatted differently, containing different columns, etc. Recitfy this and combine the world-level stats with U.S. level stats.

Corona_Cases.long<-rbind(make_long(select(Corona_Cases.US.raw,-c(UID,iso2,iso3,code3,FIPS,Combined_Key))),
make_long(Corona_Cases.raw))


##------------------------------------------
## Fix date formatting, convert to numeric date
##------------------------------------------
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "^X",replacement = "0") # leading 0 read in as X
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "20$",replacement = "2020") # ends in .20 and not 2020
Corona_Cases.long$Date<-as.Date(Corona_Cases.long$Date,format = "%m.%d.%y")
Corona_Cases.long$Date.numeric<-as.numeric(Corona_Cases.long$Date)

kable(table(select(Corona_Cases.long,c("Country.Region","case_type"))),caption = "Number of death and total case longitudinal datapoints per geographical region")
Number of death and total case longitudinal datapoints per geographical region
death total
Afghanistan 135 135
Albania 135 135
Algeria 135 135
Andorra 135 135
Angola 135 135
Antigua and Barbuda 135 135
Argentina 135 135
Armenia 135 135
Australia 1080 1080
Austria 135 135
Azerbaijan 135 135
Bahamas 135 135
Bahrain 135 135
Bangladesh 135 135
Barbados 135 135
Belarus 135 135
Belgium 135 135
Belize 135 135
Benin 135 135
Bhutan 135 135
Bolivia 135 135
Bosnia and Herzegovina 135 135
Botswana 135 135
Brazil 135 135
Brunei 135 135
Bulgaria 135 135
Burkina Faso 135 135
Burma 135 135
Burundi 135 135
Cabo Verde 135 135
Cambodia 135 135
Cameroon 135 135
Canada 1890 1890
Central African Republic 135 135
Chad 135 135
Chile 135 135
China 4455 4455
Colombia 135 135
Comoros 135 135
Congo (Brazzaville) 135 135
Congo (Kinshasa) 135 135
Costa Rica 135 135
Cote d’Ivoire 135 135
Croatia 135 135
Cuba 135 135
Cyprus 135 135
Czechia 135 135
Denmark 405 405
Diamond Princess 135 135
Djibouti 135 135
Dominica 135 135
Dominican Republic 135 135
Ecuador 135 135
Egypt 135 135
El Salvador 135 135
Equatorial Guinea 135 135
Eritrea 135 135
Estonia 135 135
Eswatini 135 135
Ethiopia 135 135
Fiji 135 135
Finland 135 135
France 1485 1485
Gabon 135 135
Gambia 135 135
Georgia 135 135
Germany 135 135
Ghana 135 135
Greece 135 135
Grenada 135 135
Guatemala 135 135
Guinea 135 135
Guinea-Bissau 135 135
Guyana 135 135
Haiti 135 135
Holy See 135 135
Honduras 135 135
Hungary 135 135
Iceland 135 135
India 135 135
Indonesia 135 135
Iran 135 135
Iraq 135 135
Ireland 135 135
Israel 135 135
Italy 135 135
Jamaica 135 135
Japan 135 135
Jordan 135 135
Kazakhstan 135 135
Kenya 135 135
Korea, South 135 135
Kosovo 135 135
Kuwait 135 135
Kyrgyzstan 135 135
Laos 135 135
Latvia 135 135
Lebanon 135 135
Lesotho 135 135
Liberia 135 135
Libya 135 135
Liechtenstein 135 135
Lithuania 135 135
Luxembourg 135 135
Madagascar 135 135
Malawi 135 135
Malaysia 135 135
Maldives 135 135
Mali 135 135
Malta 135 135
Mauritania 135 135
Mauritius 135 135
Mexico 135 135
Moldova 135 135
Monaco 135 135
Mongolia 135 135
Montenegro 135 135
Morocco 135 135
Mozambique 135 135
MS Zaandam 135 135
Namibia 135 135
Nepal 135 135
Netherlands 675 675
New Zealand 135 135
Nicaragua 135 135
Niger 135 135
Nigeria 135 135
North Macedonia 135 135
Norway 135 135
Oman 135 135
Pakistan 135 135
Panama 135 135
Papua New Guinea 135 135
Paraguay 135 135
Peru 135 135
Philippines 135 135
Poland 135 135
Portugal 135 135
Qatar 135 135
Romania 135 135
Russia 135 135
Rwanda 135 135
Saint Kitts and Nevis 135 135
Saint Lucia 135 135
Saint Vincent and the Grenadines 135 135
San Marino 135 135
Sao Tome and Principe 135 135
Saudi Arabia 135 135
Senegal 135 135
Serbia 135 135
Seychelles 135 135
Sierra Leone 135 135
Singapore 135 135
Slovakia 135 135
Slovenia 135 135
Somalia 135 135
South Africa 135 135
South Sudan 135 135
Spain 135 135
Sri Lanka 135 135
Sudan 135 135
Suriname 135 135
Sweden 135 135
Switzerland 135 135
Syria 135 135
Taiwan* 135 135
Tajikistan 135 135
Tanzania 135 135
Thailand 135 135
Timor-Leste 135 135
Togo 135 135
Trinidad and Tobago 135 135
Tunisia 135 135
Turkey 135 135
Uganda 135 135
Ukraine 135 135
United Arab Emirates 135 135
United Kingdom 1485 1485
Uruguay 135 135
US 135 135
US_state 440235 440235
Uzbekistan 135 135
Venezuela 135 135
Vietnam 135 135
West Bank and Gaza 135 135
Western Sahara 135 135
Yemen 135 135
Zambia 135 135
Zimbabwe 135 135
# Decouple population and lat/long data, refactor to make it more tidy
metadata_columns<-c("Lat","Long","Population")
metadata<-unique(select(filter(Corona_Cases.long,case_type=="death"),c("Country.Region","Province.State","City",all_of(metadata_columns))))
Corona_Cases.long<-select(Corona_Cases.long,-all_of(metadata_columns))

# Some counties are not summarized on the country level. collapse all but US
Corona_Cases.long<-rbind.fill(ddply(filter(Corona_Cases.long,!Country.Region=="US_state"),c("case_type","Country.Region","Date","Date.numeric"),summarise,Total_confirmed_cases=sum(Total_confirmed_cases)),filter(Corona_Cases.long,Country.Region=="US_state"))

# Put total case and deaths side-by-side (wide)
Corona_Cases<-spread(Corona_Cases.long,key = case_type,value = Total_confirmed_cases)

#Compute moratlity rate
Corona_Cases$mortality_rate<-Corona_Cases$death/Corona_Cases$total

#TMP
Corona_Cases<-plyr::rename(Corona_Cases,c("total"="Total_confirmed_cases","death"="Total_confirmed_deaths"))

##------------------------------------------
## log10 transform total # cases
##------------------------------------------
Corona_Cases$Total_confirmed_cases.log<-log(Corona_Cases$Total_confirmed_cases,10)
Corona_Cases$Total_confirmed_deaths.log<-log(Corona_Cases$Total_confirmed_deaths,10)
##------------------------------------------
       
##------------------------------------------
## Compute # of days since 100th for US data
##------------------------------------------

# Find day that 100th case was found for Country/Province. NOTE: Non US countries may have weird provinces. For example, Frane is summairzed at the country level but also had 3 providences. I've only ensured the U.S. case100 works... so the case100_date for U.S. is summarized both for the entire country (regardless of state) and on a per-state level. 
# TODO: consider city-level summary as well. This data may be sparse

Corona_Cases<-merge(Corona_Cases,ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,case100_date=min(Date.numeric)))
Corona_Cases$Days_since_100<-Corona_Cases$Date.numeric-Corona_Cases$case100_date

##------------------------------------------
## Add population and lat/long data (CURRENTLY US ONLY)
##------------------------------------------

kable(filter(metadata,(is.na(Country.Region) | is.na(Population) )) %>% select(c("Country.Region","Province.State","City")) %>% unique(),caption = "Regions for which either population or Country is NA")
Regions for which either population or Country is NA
Country.Region Province.State City
# Drop missing data 
metadata<-filter(metadata,!(is.na(Country.Region) | is.na(Population) ))
# Convert remaining pop to numeric
metadata$Population<-as.numeric(metadata$Population)
## Warning: NAs introduced by coercion
# Add metadata to cases
Corona_Cases<-merge(Corona_Cases,metadata,all.x = T)

##------------------------------------------
## Compute total and death cases relative to population 
##------------------------------------------

Corona_Cases$Total_confirmed_cases.per100<-100*Corona_Cases$Total_confirmed_cases/Corona_Cases$Population
Corona_Cases$Total_confirmed_deaths.per100<-100*Corona_Cases$Total_confirmed_deaths/Corona_Cases$Population


##------------------------------------------
## Filter df for US state-wide stats
##------------------------------------------

Corona_Cases.US_state<-filter(Corona_Cases,Country.Region=="US_state" & Total_confirmed_cases>0 )
Corona_Cases.US_state[!is.na(Corona_Cases.US_state$Province.State) & Corona_Cases.US_state$Province.State=="Pennsylvania" & Corona_Cases.US_state$City=="Delaware","City"]<-"Delaware_PA"

kable(table(select(Corona_Cases.US_state,c("Province.State"))),caption = "Number of longitudinal datapoints (total/death) per state")
Number of longitudinal datapoints (total/death) per state
Var1 Freq
Alabama 4873
Alaska 935
Arizona 1211
Arkansas 5130
California 4617
Colorado 4427
Connecticut 718
Delaware 299
Diamond Princess 80
District of Columbia 81
Florida 5178
Georgia 11620
Grand Princess 81
Guam 81
Hawaii 417
Idaho 2339
Illinois 6733
Indiana 6745
Iowa 6280
Kansas 5243
Kentucky 7546
Louisiana 4865
Maine 1237
Maryland 1889
Massachusetts 1211
Michigan 5914
Minnesota 5616
Mississippi 6034
Missouri 6757
Montana 2093
Nebraska 3975
Nevada 936
New Hampshire 851
New Jersey 1817
New Mexico 2068
New York 4613
North Carolina 7136
North Dakota 2473
Northern Mariana Islands 66
Ohio 6357
Oklahoma 4860
Oregon 2447
Pennsylvania 5003
Puerto Rico 81
Rhode Island 478
South Carolina 3513
South Dakota 3137
Tennessee 6883
Texas 14625
Utah 1110
Vermont 1140
Virgin Islands 81
Virginia 9022
Washington 3151
West Virginia 3331
Wisconsin 4850
Wyoming 1504
Corona_Cases.US_state<-merge(Corona_Cases.US_state,ddply(filter(Corona_Cases.US_state,Total_confirmed_cases>100),c("Province.State"),summarise,case100_date_state=min(Date.numeric)))
Corona_Cases.US_state$Days_since_100_state<-Corona_Cases.US_state$Date.numeric-Corona_Cases.US_state$case100_date_state

ANALYSIS

Q1: What is the trend in cases, mortality across geopgraphical regions?

Plot # of cases vs time
* For each geographical set:
* comparative longitudinal case trend (absolute & log scale)
* comparative longitudinal mortality trend
* death vs total correlation

question dataset x y color facet pch dimentions
comparative_longitudinal_case_trend long time log_cases geography none (case type?) case_type [15, 50, 4] geography x (2 scale?) case type
comparative longitudinal case trend long time cases geography case_type ? [15, 50, 4] geography x (2+ scale) case type
comparative longitudinal mortality trend wide time mortality rate geography none none [15, 50, 4] geography
death vs total correlation wide cases deaths geography none none [15, 50, 4] geography
# total cases vs time
# death cases vs time
# mortality rate vs time
# death vs mortality


  # death vs mortality
  # total & death case vs time (same plot)

#<question> <x> <y> <colored> <facet> <dataset>
## trend in case/deaths over time, comapred across regions <time> <log cases> <geography*> <none> <.wide>
## trend in case/deaths over time, comapred across regions <time> <cases> <geography*> <case_type> <.long>
## trend in mortality rate over time, comapred across regions <time> <mortality rate> <geography*> <none>
## how are death/mortality related/correlated? <time> <log cases> <geography*> <none>
## how are death and case load correlated? <cases> <deaths>

# lm for each?? - > apply lm from each region starting from 100th case. m, b associated with each.
    # input: geographical regsion, logcase vs day (100th case)
    # output: m, b for each geographical region ID



#total/death on same plot-  diffeer by 2 logs, so when plotting log, use pch. when plotting absolute, need to use free scales
#when plotting death and case on same, melt. 

#CoronaCases - > filter sets (3)
  #world - choose countries with sufficent data

N<-ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,n=length(Country.Region))
ggplot(filter(N,n<100),aes(x=n))+
  geom_histogram()+
  default_theme+
  ggtitle("Distribution of number of days with at least 100 confirmed cases for each region")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

kable(arrange(N,-n),caption="Sorted number of days with at least 100 confirmed cases")
Sorted number of days with at least 100 confirmed cases
Country.Region n
US_state 48210
China 135
Diamond Princess 116
Korea, South 106
Japan 105
Italy 103
Iran 100
Singapore 97
France 96
Germany 96
Spain 95
US 94
Switzerland 92
United Kingdom 92
Belgium 91
Netherlands 91
Norway 91
Sweden 91
Austria 89
Malaysia 88
Australia 87
Bahrain 87
Denmark 87
Canada 86
Qatar 86
Iceland 85
Brazil 84
Czechia 84
Finland 84
Greece 84
Iraq 84
Israel 84
Portugal 84
Slovenia 84
Egypt 83
Estonia 83
India 83
Ireland 83
Kuwait 83
Philippines 83
Poland 83
Romania 83
Saudi Arabia 83
Indonesia 82
Lebanon 82
San Marino 82
Thailand 82
Chile 81
Pakistan 81
Luxembourg 80
Peru 80
Russia 80
Ecuador 79
Mexico 79
Slovakia 79
South Africa 79
United Arab Emirates 79
Armenia 78
Colombia 78
Croatia 78
Panama 78
Serbia 78
Taiwan* 78
Turkey 78
Argentina 77
Bulgaria 77
Latvia 77
Uruguay 77
Algeria 76
Costa Rica 76
Dominican Republic 76
Hungary 76
Andorra 75
Bosnia and Herzegovina 75
Jordan 75
Lithuania 75
Morocco 75
New Zealand 75
North Macedonia 75
Vietnam 75
Albania 74
Cyprus 74
Malta 74
Moldova 74
Brunei 73
Burkina Faso 73
Sri Lanka 73
Tunisia 73
Ukraine 72
Azerbaijan 71
Ghana 71
Kazakhstan 71
Oman 71
Senegal 71
Venezuela 71
Afghanistan 70
Cote d’Ivoire 70
Cuba 69
Mauritius 69
Uzbekistan 69
Cambodia 68
Cameroon 68
Honduras 68
Nigeria 68
West Bank and Gaza 68
Belarus 67
Georgia 67
Bolivia 66
Kosovo 66
Kyrgyzstan 66
Montenegro 66
Congo (Kinshasa) 65
Kenya 64
Niger 63
Guinea 62
Rwanda 62
Trinidad and Tobago 62
Paraguay 61
Bangladesh 60
Djibouti 58
El Salvador 57
Guatemala 56
Madagascar 55
Mali 54
Congo (Brazzaville) 51
Jamaica 51
Gabon 49
Somalia 49
Tanzania 49
Ethiopia 48
Burma 47
Sudan 46
Liberia 45
Maldives 43
Equatorial Guinea 42
Cabo Verde 40
Sierra Leone 38
Guinea-Bissau 37
Togo 37
Zambia 36
Eswatini 35
Chad 34
Tajikistan 33
Haiti 31
Sao Tome and Principe 31
Benin 29
Nepal 29
Uganda 29
Central African Republic 28
South Sudan 28
Guyana 26
Mozambique 25
Yemen 21
Mongolia 20
Mauritania 17
Nicaragua 17
Malawi 11
Syria 11
Zimbabwe 9
Bahamas 8
Libya 8
Comoros 6
# Pick top 15 countries with data
max_colors<-12
# find way to fix this- China has diff provences. Plot doesnt look right...
sufficient_data<-arrange(filter(N,!Country.Region %in% c("US_state", "Diamond Princess")),-n)[1:max_colors,]
kable(sufficient_data,caption = paste0("Top ",max_colors," countries with sufficient data"))
Top 12 countries with sufficient data
Country.Region n
China 135
Korea, South 106
Japan 105
Italy 103
Iran 100
Singapore 97
France 96
Germany 96
Spain 95
US 94
Switzerland 92
United Kingdom 92
Corona_Cases.world<-filter(Corona_Cases,Country.Region %in% c(sufficient_data$Country.Region))


  #us 
  #    - by state
Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
# summarize 
#!City %in% c("Unassigned") 
  #    - specific cities
#mortality_rate!=Inf & mortality_rate<=1
head(Corona_Cases)
##   Country.Region Province.State City       Date Date.numeric
## 1    Afghanistan           <NA> <NA> 2020-03-20        18341
## 2    Afghanistan           <NA> <NA> 2020-03-19        18340
## 3    Afghanistan           <NA> <NA> 2020-05-27        18409
## 4    Afghanistan           <NA> <NA> 2020-05-28        18410
## 5    Afghanistan           <NA> <NA> 2020-03-21        18342
## 6    Afghanistan           <NA> <NA> 2020-02-05        18297
##   Total_confirmed_deaths Total_confirmed_cases mortality_rate
## 1                      0                    24     0.00000000
## 2                      0                    22     0.00000000
## 3                    227                 12456     0.01822415
## 4                    235                 13036     0.01802700
## 5                      0                    24     0.00000000
## 6                      0                     0            NaN
##   Total_confirmed_cases.log Total_confirmed_deaths.log case100_date
## 1                  1.380211                       -Inf        18348
## 2                  1.342423                       -Inf        18348
## 3                  4.095379                   2.356026        18348
## 4                  4.115144                   2.371068        18348
## 5                  1.380211                       -Inf        18348
## 6                      -Inf                       -Inf        18348
##   Days_since_100 Lat Long Population Total_confirmed_cases.per100
## 1             -7  NA   NA         NA                           NA
## 2             -8  NA   NA         NA                           NA
## 3             61  NA   NA         NA                           NA
## 4             62  NA   NA         NA                           NA
## 5             -6  NA   NA         NA                           NA
## 6            -51  NA   NA         NA                           NA
##   Total_confirmed_deaths.per100
## 1                            NA
## 2                            NA
## 3                            NA
## 4                            NA
## 5                            NA
## 6                            NA
Corona_Cases[!is.na(Corona_Cases$Province.State) & Corona_Cases$Province.State=="Pennsylvania" & Corona_Cases$City=="Delaware","City"]<-"Delaware_PA"


Corona_Cases.UScity<-filter(Corona_Cases,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") & City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA"))


measure_vars_long<-c("Total_confirmed_cases.log","Total_confirmed_cases","Total_confirmed_deaths","Total_confirmed_deaths.log")
melt_arg_list<-list(variable.name = "case_type",value.name = "cases",measure.vars = c("Total_confirmed_cases","Total_confirmed_deaths"))
melt_arg_list$data=NULL


melt_arg_list$data=select(Corona_Cases.world,-ends_with(match = "log"))
Corona_Cases.world.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.UScity,-ends_with(match = "log"))
Corona_Cases.UScity.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.US_state,-ends_with(match = "log"))
Corona_Cases.US_state.long<-do.call(melt,melt_arg_list)

Corona_Cases.world.long$cases.log<-log(Corona_Cases.world.long$cases,10)
Corona_Cases.US_state.long$cases.log<-log(Corona_Cases.US_state.long$cases,10)
Corona_Cases.UScity.long$cases.log<-log(Corona_Cases.UScity.long$cases,10)


# what is the current death and total case load for US? For world? For states?
#-absolute
#-log

# what is mortality rate (US, world)
#-absolute

#how is death and case correlated? (US, world)
#-absolute
#Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
#Corona_Cases.US.case100<-filter(Corona_Cases.US, Days_since_100>=0)
# linear model parameters
#(model_fit<-lm(formula = Total_confirmed_cases.log~Days_since_100,data= Corona_Cases.US.case100 ))

#(slope<-model_fit$coefficients[2])
#(intercept<-model_fit$coefficients[1])

# Correlation coefficient
#cor(x = Corona_Cases.US.case100$Days_since_100,y = Corona_Cases.US.case100$Total_confirmed_cases.log)

##------------------------------------------
## Plot World Data
##------------------------------------------
# Timestamp for world
timestamp_plot.world<-paste("Most recent date for which data available:",max(Corona_Cases.world$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")


# Base template for plots
baseplot.world<-ggplot(data=NULL,aes(x=Days_since_100,col=Country.Region))+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))


##/////////////////////////
### Plot Longitudinal cases

(Corona_Cases.world.long.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world)
    )

(Corona_Cases.world.loglong.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases.log))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases.log))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world))

##/////////////////////////
### Plot Longitudinal mortality rate

(Corona_Cases.world.mortality.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world,aes(y=mortality_rate))+
    geom_line(data=Corona_Cases.world,aes(y=mortality_rate))+
    ylim(c(0,0.3))+
    ggtitle(timestamp_plot.world))
## Warning: Removed 100 rows containing missing values (geom_point).
## Warning: Removed 100 row(s) containing missing values (geom_path).

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.world.casecor.plot<-ggplot(Corona_Cases.world,aes(x=Total_confirmed_cases,y=Total_confirmed_deaths,col=Country.Region))+
  geom_point()+
  geom_line()+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
    ggtitle(timestamp_plot.world))

### Write polots

write_plot(Corona_Cases.world.long.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.long.plot.png"
write_plot(Corona_Cases.world.loglong.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.loglong.plot.png"
write_plot(Corona_Cases.world.mortality.plot,wd = results_dir)
## Warning: Removed 100 rows containing missing values (geom_point).

## Warning: Removed 100 row(s) containing missing values (geom_path).
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.mortality.plot.png"
write_plot(Corona_Cases.world.casecor.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.casecor.plot.png"
##------------------------------------------
## Plot US State Data
##-----------------------------------------

baseplot.US<-ggplot(data=NULL,aes(x=Days_since_100_state,col=case_type))+
  default_theme+
  facet_wrap(~Province.State)+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))

Corona_Cases.US_state.long.plot<-baseplot.US+geom_point(data=Corona_Cases.US_state.long,aes(y=cases.log))
##------------------------------------------
## Plot US City Data
##-----------------------------------------

Corona_Cases.US.plotdata<-filter(Corona_Cases.US_state,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") &
                                   City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA") &
                                   Total_confirmed_cases>0) 
timestamp_plot<-paste("Most recent date for which data available:",max(Corona_Cases.US.plotdata$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")

city_colors<-c("Bucks"='#beaed4',"Baltimore City"='#386cb0', "New York"='#7fc97f',"Burlington"='#fdc086',"Cape May"="#e78ac3","Delaware_PA"="#b15928")

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.city.loglong.plot<-ggplot(melt(Corona_Cases.US.plotdata,measure.vars = c("Total_confirmed_cases.log","Total_confirmed_deaths.log"),variable.name = "case_type",value.name = "cases"),aes(x=Date,y=cases,col=City,pch=case_type))+
  geom_point(size=4)+
    geom_line()+
  default_theme+
  #facet_wrap(~case_type)+
    ggtitle(paste("Log10 total and death cases over time,",timestamp_plot))+
theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
    scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.long.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State,scales = "free_y")+
  ggtitle(paste("MD, PA, NJ total cases over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))
+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.mortality.plot<-ggplot(Corona_Cases.US.plotdata,aes(x=Date,y=mortality_rate,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Mortality rate (deaths/total) over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.casecor.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(y=Total_confirmed_deaths,x=Total_confirmed_cases,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Correlation of death vs total cases,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
  scale_color_manual(values = city_colors))

(Corona_Cases.city.long.normalized.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases.per100,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State)+
  ggtitle(paste("MD, PA, NJ total cases over time per 100 people,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)  +
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

write_plot(Corona_Cases.city.long.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.plot.png"
write_plot(Corona_Cases.city.loglong.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.loglong.plot.png"
write_plot(Corona_Cases.city.mortality.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.mortality.plot.png"
write_plot(Corona_Cases.city.casecor.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.casecor.plot.png"
write_plot(Corona_Cases.city.long.normalized.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.normalized.plot.png"

Q1b what is the model

Fit the cases to a linear model 1. Find time at which the case vs date becomes linear in each plot
2. Fit linear model for each city

# What is the predict # of cases for the next few days?
# How is the model performing historically?

Corona_Cases.US_state.summary<-ddply(Corona_Cases.US_state,
                                     c("Province.State","Date"),
                                     summarise,
                                     Total_confirmed_cases_perstate=sum(Total_confirmed_cases)) %>% 
    filter(Total_confirmed_cases_perstate>100)

# Compute the states with the most cases (for coloring and for linear model)
top_states_totals<-head(ddply(Corona_Cases.US_state.summary,c("Province.State"),summarise, Total_confirmed_cases_perstate.max=max(Total_confirmed_cases_perstate)) %>% arrange(-Total_confirmed_cases_perstate.max),n=max_colors)

kable(top_states_totals,caption = "Top 12 States, total count ")
top_states<-top_states_totals$Province.State

# Manually fix states so that Maryland is switched out for New York
top_states_modified<-c(top_states[top_states !="New York"],"Maryland")

# Plot with all states:
(Corona_Cases.US_state.summary.plot<-ggplot(Corona_Cases.US_state.summary,aes(x=Date,y=Total_confirmed_cases_perstate))+
  geom_point()+
  geom_point(data=filter(Corona_Cases.US_state.summary,Province.State %in% top_states),aes(col=Province.State))+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  default_theme+
  theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
  ggtitle("Total confirmed cases per state, top 12 colored")+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Fit linear model to time vs total cases
##-----------------------------------------

# First, find the date at which each state's cases vs time becomes lienar (2nd derivative is about 0)
li<-ddply(Corona_Cases.US_state.summary,c("Province.State"),find_linear_index)

# Compute linear model for each state starting at the point at which data becomes linear
for(i in 1:nrow(li)){
  Province.State.i<-li[i,"Province.State"]
  date.i<-li[i,"V1"]
  data.i<-filter(Corona_Cases.US_state.summary,Province.State==Province.State.i & as.numeric(Date) >= date.i)
  model_results<-lm(data.i,formula = Total_confirmed_cases_perstate~Date)
  slope<-model_results$coefficients[2]
  intercept<-model_results$coefficients[1]
  li[li$Province.State==Province.State.i,"m"]<-slope
  li[li$Province.State==Province.State.i,"b"]<-intercept
  }

# Compute top state case load with fitted model

(Corona_Cases.US_state.lm.plot<-ggplot(filter(Corona_Cases.US_state.summary,Province.State %in% top_states_modified ))+
    geom_abline(data=filter(li,Province.State %in% top_states_modified),
                aes(slope = m,intercept = b,col=Province.State),lty=2)+
    geom_point(aes(x=Date,y=Total_confirmed_cases_perstate,col=Province.State))+
    scale_color_brewer(type = "qualitative",palette = "Paired")+
    default_theme+
    theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
    ggtitle("Total confirmed cases per state, top 12 colored")+
    scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Predict the number of total cases over the next week
##-----------------------------------------

predicted_days<-c(0,1,2,3,7)+as.numeric(as.Date("2020-04-20"))

predicted_days_df<-data.frame(matrix(ncol=3))
names(predicted_days_df)<-c("Province.State","days","Total_confirmed_cases_perstate")

# USe model parameters to estiamte case loads
for(state.i in top_states_modified){
  predicted_days_df<-rbind(predicted_days_df,
                           data.frame(Province.State=state.i,
                                      prediction_model(m = li[li$Province.State==state.i,"m"],
                                                       b =li[li$Province.State==state.i,"b"] ,
                                                       days =predicted_days )))
  }

predicted_days_df$Date<-as.Date(predicted_days_df$days,origin="1970-01-01")

kable(predicted_days_df,caption = "Predicted total cases over the next week for selected states")

##------------------------------------------
## Write plots
##-----------------------------------------

write_plot(Corona_Cases.US_state.summary.plot,wd = results_dir)
write_plot(Corona_Cases.US_state.lm.plot,wd = results_dir)

##------------------------------------------
## Write tables
##-----------------------------------------

write.csv(predicted_days_df,file = paste0(results_dir,"predicted_total_cases_days.csv"),quote = F,row.names = F)

Q2: What is the predicted number of cases?

What is the prediction of COVID-19 based on model thus far? Additional questions:

WHy did it take to day 40 to start a log linear trend? How long will it be till x number of cases? When will the plateu happen? Are any effects noticed with social distancing? Delays

##------------------------------------------
## Prediction and Prediction Accuracy
##------------------------------------------


today_num<-max(Corona_Cases.US$Days_since_100)
predicted_days<-today_num+c(1,2,3,7)

#mods = dlply(mydf, .(x3), lm, formula = y ~ x1 + x2)
#today:
Corona_Cases.US[Corona_Cases.US$Days_since_100==(today_num-1),]
Corona_Cases.US[Corona_Cases.US$Days_since_100==today_num,]
Corona_Cases.US$type<-"Historical"


#prediction_values<-prediction_model(m=slope,b=intercept,days = predicted_days)$Total_confirmed_cases

histoical_model<-data.frame(date=today_num,m=slope,b=intercept)
tmp<-data.frame(state=rep(c("A","B"),each=3),x=c(1,2,3,4,5,6))
tmp$y<-c(tmp[1:3,"x"]+5,tmp[4:6,"x"]*5+1)
ddply(tmp,c("state"))
lm(data =tmp,formula = y~x )

train_lm<-function(input_data,subset_coulmn,formula_input){
case_models <- dlply(input_data, subset_coulmn, lm, formula = formula_input)
case_models.parameters <- ldply(case_models, coef)
case_models.parameters<-rename(case_models.parameters,c("b"="(Intercept)","m"=subset_coulmn))
return(case_models.parameters)
}

train_lm(tmp,"state")

 dlply(input_data, subset_coulmn, lm,m=)
 
# model for previous y days
#historical_model_predictions<-data.frame(day_x=NULL,Days_since_100=NULL,Total_confirmed_cases=NULL,Total_confirmed_cases.log=NULL)
# for(i in c(1,2,3,4,5,6,7,8,9,10)){
#   #i<-1
# day_x<-today_num-i # 1, 2, 3, 4
# day_x_nextweek<-day_x+c(1,2,3)
# model_fit_x<-lm(data = filter(Corona_Cases.US.case100,Days_since_100 < day_x),formula = Total_confirmed_cases.log~Days_since_100)
# prediction_day_x_nextweek<-prediction_model(m = model_fit_x$coefficients[2],b = model_fit_x$coefficients[1],days = day_x_nextweek)
# prediction_day_x_nextweek$type<-"Predicted"
# acutal_day_x_nextweek<-filter(Corona_Cases.US,Days_since_100 %in% day_x_nextweek) %>% select(c(Days_since_100,Total_confirmed_cases,Total_confirmed_cases.log))
# acutal_day_x_nextweek$type<-"Historical"
# historical_model_predictions.i<-data.frame(day_x=day_x,rbind(acutal_day_x_nextweek,prediction_day_x_nextweek))
# historical_model_predictions<-rbind(historical_model_predictions.i,historical_model_predictions)
# }

#historical_model_predictions.withHx<-rbind.fill(historical_model_predictions,data.frame(Corona_Cases.US,type="Historical"))
#historical_model_predictions.withHx$Total_confirmed_cases.log2<-log(historical_model_predictions.withHx$Total_confirmed_cases,2)

(historical_model_predictions.plot<-ggplot(historical_model_predictions.withHx,aes(x=Days_since_100,y=Total_confirmed_cases.log,col=type))+
    geom_point(size=3)+
    default_theme+
    theme(legend.position = "bottom")+ 
      #geom_abline(slope = slope,intercept =intercept,lty=2)+
    #facet_wrap(~case_type,ncol=1)+
    scale_color_manual(values = c("Historical"="#377eb8","Predicted"="#e41a1c")))
write_plot(historical_model_predictions.plot,wd=results_dir)

Q3: What is the effect on social distancing, descreased mobility on case load?

Load data from Google which compoutes % change in user mobility relative to baseline for * Recreation
* Workplace
* Residence
* Park
* Grocery

Data from https://www.google.com/covid19/mobility/

# See pre-processing section for script on gathering mobility data

# UNDER DEVELOPMENT

mobility<-read.csv("/Users/stevensmith/Projects/MIT_COVID19/mobility.csv",header = T,stringsAsFactors = F)
#mobility$Retail_Recreation<-as.numeric(sub(mobility$Retail_Recreation,pattern = "%",replacement = ""))
#mobility$Workplace<-as.numeric(sub(mobility$Workplace,pattern = "%",replacement = ""))
#mobility$Residential<-as.numeric(sub(mobility$Residential,pattern = "%",replacement = ""))

##------------------------------------------
## Show relationship between mobility and caseload
##------------------------------------------
mobility$County<-gsub(mobility$County,pattern = " County",replacement = "")
Corona_Cases.US_state.mobility<-merge(Corona_Cases.US_state,plyr::rename(mobility,c("State"="Province.State","County"="City")))

#Corona_Cases.US_state.tmp<-merge(metadata,Corona_Cases.US_state.tmp)
# Needs to happen upsteam, see todos
#Corona_Cases.US_state.tmp$Total_confirmed_cases.perperson<-Corona_Cases.US_state.tmp$Total_confirmed_cases/as.numeric(Corona_Cases.US_state.tmp$Population)
mobility_measures<-c("Retail_Recreation","Grocery_Pharmacy","Parks","Transit","Workplace","Residential")

plot_data<-filter(Corona_Cases.US_state.mobility, Date.numeric==max(Corona_Cases.US_state$Date.numeric) ) %>% melt(measure.vars=mobility_measures) 
plot_data$value<-as.numeric(gsub(plot_data$value,pattern = "%",replacement = ""))
plot_data<-filter(plot_data,!is.na(value))

(mobility.plot<-ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_grid(Province.State~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases per 100 people(Today)"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

(mobility.global.plot<-ggplot(plot_data,aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_wrap(~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases (Today) per 100 people"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

plot_data.permobility_summary<-ddply(plot_data,c("Province.State","variable"),summarise,cor=cor(y =Total_confirmed_cases.per100,x=value),median_change=median(x=value)) %>% arrange(-abs(cor))

kable(plot_data.permobility_summary,caption = "Ranked per-state mobility correlation with total confirmed cases")
Ranked per-state mobility correlation with total confirmed cases
Province.State variable cor median_change
Alaska Transit 1.0000000 -63.0
Delaware Retail_Recreation 1.0000000 -39.5
Delaware Grocery_Pharmacy 1.0000000 -17.5
Delaware Parks -1.0000000 20.5
Delaware Transit 1.0000000 -37.0
Delaware Workplace 1.0000000 -37.0
Delaware Residential -1.0000000 14.0
Hawaii Retail_Recreation 0.9981406 -56.0
Hawaii Grocery_Pharmacy 0.9816785 -34.0
New Hampshire Parks 0.9551603 -20.0
Maine Transit -0.9144999 -50.0
Utah Residential -0.8932683 12.0
Connecticut Grocery_Pharmacy -0.8899222 -6.0
Vermont Parks 0.8877449 -35.5
South Dakota Parks 0.8400530 -26.0
Utah Transit -0.8134464 -18.0
Wyoming Parks -0.7897688 -4.0
Massachusetts Workplace -0.7550263 -39.0
Rhode Island Workplace -0.7539745 -39.5
Connecticut Transit -0.7500513 -50.0
Hawaii Residential -0.7498071 19.0
Alaska Workplace -0.7462002 -33.0
Alaska Grocery_Pharmacy -0.7387382 -7.0
Alaska Residential 0.7373556 13.0
Hawaii Parks 0.7210511 -72.0
Arizona Grocery_Pharmacy -0.7167368 -15.0
Wyoming Transit -0.7084095 -17.0
Hawaii Transit 0.6616564 -89.0
Vermont Grocery_Pharmacy -0.6578329 -25.0
New York Workplace -0.6498118 -34.5
Utah Parks -0.6496621 17.0
Rhode Island Retail_Recreation -0.6296900 -45.0
Maine Workplace -0.6292178 -30.0
Rhode Island Residential -0.6121146 18.5
Utah Workplace -0.6069536 -37.0
New Jersey Parks -0.6034489 -6.0
New York Retail_Recreation -0.5879606 -46.0
Nebraska Workplace 0.5845887 -32.0
North Dakota Parks 0.5766987 -34.0
Arizona Retail_Recreation -0.5558848 -42.5
New Jersey Workplace -0.5558231 -44.0
Arizona Residential 0.5370730 13.0
Nevada Transit -0.5300771 -20.0
New York Parks 0.5288292 20.0
Connecticut Residential 0.5179224 14.0
Connecticut Retail_Recreation -0.5133723 -45.0
Maine Parks 0.5034998 -31.0
Massachusetts Retail_Recreation -0.5029260 -44.0
North Dakota Retail_Recreation -0.4941084 -42.0
Hawaii Workplace 0.4919233 -46.0
New Jersey Grocery_Pharmacy -0.4825638 2.5
Kansas Parks 0.4806868 72.0
New Mexico Grocery_Pharmacy -0.4792608 -11.0
West Virginia Parks 0.4791711 -33.0
Connecticut Workplace -0.4790251 -39.0
Montana Parks -0.4784136 -58.0
Nebraska Residential -0.4721875 14.0
Rhode Island Parks 0.4677024 52.0
Montana Residential 0.4657252 14.0
Iowa Parks -0.4585695 28.5
New Jersey Retail_Recreation -0.4534924 -62.5
Idaho Residential -0.4512222 11.0
New Mexico Residential 0.4486365 13.5
Wyoming Workplace -0.4428999 -31.0
Illinois Transit -0.4374234 -31.0
New Mexico Parks 0.4323152 -31.5
Pennsylvania Workplace -0.4300693 -36.0
Maryland Workplace -0.4236538 -35.0
South Carolina Workplace 0.4224839 -30.0
Arizona Transit 0.4215412 -38.0
Vermont Residential 0.4175568 11.5
New Jersey Transit -0.4117657 -50.5
Massachusetts Grocery_Pharmacy -0.4116300 -7.0
New Hampshire Residential -0.4059542 14.0
Alabama Grocery_Pharmacy -0.4046073 -2.0
Kentucky Parks -0.3936095 28.5
Maryland Grocery_Pharmacy -0.3866665 -10.0
Pennsylvania Retail_Recreation -0.3784634 -45.0
Alabama Workplace -0.3756915 -29.0
Idaho Workplace -0.3739069 -29.0
Idaho Grocery_Pharmacy -0.3725278 -5.5
New York Transit -0.3712434 -48.0
California Transit -0.3691606 -42.0
Wisconsin Transit -0.3646328 -23.5
Michigan Parks 0.3554515 30.0
California Residential 0.3541438 14.0
New Mexico Retail_Recreation -0.3525881 -42.5
Idaho Transit -0.3480983 -30.0
Nebraska Grocery_Pharmacy 0.3408005 -0.5
Wyoming Grocery_Pharmacy -0.3339218 -10.0
Alaska Retail_Recreation 0.3243757 -39.0
North Carolina Grocery_Pharmacy 0.3207298 0.0
California Parks -0.3132199 -38.5
Maine Retail_Recreation -0.3119032 -42.0
Virginia Transit -0.3106246 -33.0
Minnesota Transit -0.3048052 -28.5
Maryland Retail_Recreation -0.3005922 -39.0
Nevada Residential 0.2951500 17.0
Vermont Retail_Recreation 0.2923348 -57.0
Pennsylvania Parks 0.2902549 12.0
Alabama Transit -0.2898198 -36.5
Arkansas Retail_Recreation -0.2897917 -30.0
Colorado Residential 0.2892009 14.0
North Carolina Workplace 0.2854308 -31.0
Mississippi Residential 0.2847984 13.0
Florida Residential 0.2843767 14.0
North Dakota Workplace 0.2840922 -40.0
Illinois Workplace -0.2836768 -30.5
Oregon Grocery_Pharmacy 0.2748764 -7.0
Rhode Island Transit -0.2692428 -56.0
Texas Residential -0.2690715 15.0
Texas Workplace 0.2663692 -32.0
Georgia Grocery_Pharmacy -0.2638408 -10.0
Missouri Residential -0.2584970 13.0
Rhode Island Grocery_Pharmacy 0.2569036 -7.5
Vermont Workplace -0.2548342 -43.0
Kansas Workplace 0.2543891 -32.0
Maryland Residential 0.2543145 15.0
West Virginia Grocery_Pharmacy -0.2518129 -6.0
Tennessee Workplace -0.2492381 -31.0
Montana Workplace -0.2480205 -40.0
Illinois Parks 0.2466557 26.5
Florida Parks -0.2463378 -43.0
Montana Transit 0.2444841 -41.0
Pennsylvania Grocery_Pharmacy -0.2433644 -6.0
Arkansas Residential 0.2395518 12.0
Idaho Retail_Recreation -0.2380622 -39.5
South Carolina Parks -0.2364358 -23.0
Tennessee Residential 0.2329346 11.5
New York Grocery_Pharmacy -0.2308351 8.0
Nevada Retail_Recreation -0.2268358 -43.0
Arkansas Parks 0.2258157 -12.0
Georgia Workplace -0.2172532 -33.5
Utah Retail_Recreation -0.2151101 -40.0
Wisconsin Parks 0.2119734 51.5
Illinois Residential 0.2043008 14.0
Georgia Retail_Recreation -0.2017063 -41.0
Michigan Workplace -0.2002614 -40.0
North Carolina Transit 0.1993158 -32.0
Mississippi Grocery_Pharmacy -0.1988625 -8.0
Washington Grocery_Pharmacy 0.1958700 -7.0
North Dakota Grocery_Pharmacy -0.1951219 -8.0
West Virginia Workplace 0.1942676 -33.0
Kansas Grocery_Pharmacy -0.1935502 -14.0
Colorado Parks -0.1879658 2.0
Minnesota Parks 0.1872074 -9.0
Texas Parks 0.1854674 -42.0
Kentucky Transit -0.1819855 -31.0
Virginia Residential 0.1813304 14.0
California Grocery_Pharmacy -0.1804892 -11.5
North Carolina Residential 0.1788085 13.0
Wisconsin Residential -0.1785030 14.0
South Dakota Workplace 0.1775769 -35.0
Connecticut Parks 0.1762029 43.0
New Jersey Residential 0.1751858 18.0
Kentucky Grocery_Pharmacy 0.1683251 4.0
California Retail_Recreation -0.1675263 -44.0
Iowa Transit 0.1664721 -24.0
New Mexico Transit 0.1644552 -38.5
Pennsylvania Transit -0.1630487 -42.0
Massachusetts Parks 0.1612698 39.0
Washington Workplace -0.1606326 -38.0
New Hampshire Retail_Recreation -0.1604343 -41.0
Missouri Workplace 0.1603830 -28.0
Ohio Transit 0.1575905 -28.0
Indiana Residential 0.1559371 12.0
Indiana Retail_Recreation 0.1520955 -38.0
Florida Retail_Recreation 0.1472681 -43.0
Georgia Residential -0.1450504 13.0
Nebraska Transit -0.1449693 -9.0
Oregon Workplace -0.1444180 -31.0
Virginia Grocery_Pharmacy -0.1440830 -8.0
South Dakota Transit -0.1417546 -40.0
Oregon Transit 0.1413508 -27.5
South Dakota Residential 0.1379513 15.0
Michigan Retail_Recreation -0.1366892 -53.0
Mississippi Retail_Recreation -0.1366561 -40.0
Alabama Parks 0.1345378 -1.0
Mississippi Transit -0.1342114 -38.5
North Dakota Transit 0.1314598 -48.0
Wisconsin Workplace -0.1263219 -31.0
Washington Residential 0.1256584 13.0
Nevada Workplace -0.1236802 -40.0
Virginia Parks 0.1218517 6.0
Alabama Retail_Recreation 0.1209011 -39.0
Minnesota Retail_Recreation 0.1201792 -40.0
Arizona Parks -0.1183385 -44.5
Indiana Parks -0.1178279 29.0
California Workplace -0.1148741 -36.0
Oregon Retail_Recreation 0.1129807 -40.5
Massachusetts Transit -0.1125224 -45.0
New Hampshire Grocery_Pharmacy -0.1119111 -6.0
Ohio Parks -0.1109293 67.5
Wisconsin Grocery_Pharmacy 0.1101294 -1.0
Wyoming Residential 0.1090806 12.5
Nebraska Retail_Recreation 0.1065169 -36.0
Texas Grocery_Pharmacy 0.1059299 -14.0
Kansas Transit -0.1051291 -26.5
Maryland Transit -0.1038073 -39.0
Texas Transit 0.1030240 -41.0
Kentucky Residential 0.1028650 12.0
Massachusetts Residential 0.1017334 15.0
North Carolina Parks -0.1008428 7.0
Arkansas Workplace -0.1001121 -26.0
Idaho Parks 0.0997497 -22.0
Ohio Residential 0.0980483 14.0
North Dakota Residential -0.0979270 17.0
Oklahoma Grocery_Pharmacy -0.0973996 -0.5
Minnesota Workplace -0.0961775 -33.0
Oregon Residential 0.0946687 10.5
Tennessee Parks -0.0944820 10.5
New York Residential 0.0925416 17.5
South Carolina Residential -0.0923748 12.0
Oklahoma Residential 0.0914107 15.0
Nevada Parks 0.0867346 -12.5
Georgia Parks 0.0860662 -6.0
Indiana Workplace 0.0857592 -34.0
Virginia Workplace -0.0846464 -31.5
Pennsylvania Residential 0.0845514 15.0
Missouri Transit -0.0817257 -24.5
Maine Residential -0.0812125 11.0
Oklahoma Workplace 0.0803334 -31.0
Wyoming Retail_Recreation -0.0787024 -39.0
Michigan Residential 0.0776779 15.0
Arkansas Transit 0.0774039 -27.0
North Carolina Retail_Recreation 0.0772187 -34.0
Virginia Retail_Recreation -0.0760514 -35.0
Ohio Grocery_Pharmacy 0.0753402 0.0
Washington Transit -0.0739423 -33.5
Colorado Transit 0.0730944 -36.0
South Carolina Transit 0.0724508 -45.0
Michigan Transit 0.0716594 -46.0
Minnesota Grocery_Pharmacy 0.0699010 -6.0
Iowa Workplace 0.0687319 -30.0
Arizona Workplace -0.0681720 -35.0
Minnesota Residential -0.0678343 17.0
Kentucky Retail_Recreation 0.0676798 -29.0
Michigan Grocery_Pharmacy -0.0674201 -11.0
Mississippi Parks -0.0671417 -25.0
Indiana Grocery_Pharmacy -0.0664965 -5.5
Florida Grocery_Pharmacy 0.0654229 -14.0
Mississippi Workplace -0.0644415 -33.0
Utah Grocery_Pharmacy 0.0632600 -4.0
Florida Workplace -0.0627411 -33.0
Washington Parks 0.0618249 -3.5
West Virginia Transit -0.0616678 -45.0
Iowa Grocery_Pharmacy -0.0587747 4.0
Texas Retail_Recreation 0.0582585 -40.0
Ohio Retail_Recreation 0.0557450 -36.0
Missouri Parks 0.0539088 0.0
Maine Grocery_Pharmacy -0.0491931 -13.0
Alabama Residential -0.0491413 11.0
West Virginia Residential -0.0487944 11.0
South Carolina Grocery_Pharmacy 0.0477584 1.0
Kentucky Workplace -0.0469354 -36.5
Illinois Grocery_Pharmacy -0.0465534 2.0
New Hampshire Workplace 0.0420473 -37.0
Indiana Transit 0.0410392 -29.0
South Dakota Retail_Recreation -0.0373756 -39.0
Montana Grocery_Pharmacy -0.0369826 -16.0
Wisconsin Retail_Recreation 0.0355492 -44.0
South Carolina Retail_Recreation -0.0353652 -35.0
Ohio Workplace -0.0340792 -35.0
Colorado Grocery_Pharmacy -0.0302752 -17.0
Oregon Parks 0.0282445 16.5
Vermont Transit -0.0278734 -63.0
Illinois Retail_Recreation 0.0259000 -40.0
Tennessee Grocery_Pharmacy 0.0239617 6.0
Colorado Retail_Recreation -0.0229592 -44.0
Missouri Retail_Recreation -0.0224675 -36.5
Washington Retail_Recreation -0.0198827 -42.0
Montana Retail_Recreation -0.0191786 -51.0
New Hampshire Transit -0.0188302 -57.0
New Mexico Workplace 0.0178465 -34.0
Florida Transit -0.0165560 -49.0
Kansas Residential -0.0164336 13.0
Oklahoma Retail_Recreation 0.0154539 -31.0
Oklahoma Parks 0.0152846 -18.5
Georgia Transit -0.0132219 -35.0
Iowa Residential -0.0129721 13.0
Colorado Workplace 0.0122325 -39.0
Oklahoma Transit 0.0115364 -26.0
Missouri Grocery_Pharmacy 0.0112577 2.0
Arkansas Grocery_Pharmacy -0.0109455 3.0
Nevada Grocery_Pharmacy 0.0107420 -12.5
West Virginia Retail_Recreation -0.0102852 -38.5
Kansas Retail_Recreation -0.0102258 -38.0
Tennessee Transit 0.0100314 -32.0
South Dakota Grocery_Pharmacy 0.0089837 -9.0
Tennessee Retail_Recreation -0.0082940 -30.0
Nebraska Parks 0.0071398 55.5
Maryland Parks 0.0038485 27.0
Iowa Retail_Recreation 0.0027313 -38.0
Alaska Parks NA 29.0
District of Columbia Retail_Recreation NA -69.0
District of Columbia Grocery_Pharmacy NA -28.0
District of Columbia Parks NA -65.0
District of Columbia Transit NA -69.0
District of Columbia Workplace NA -48.0
District of Columbia Residential NA 17.0
# sanity check
ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(x=Total_confirmed_cases.per100,fill=variable))+geom_histogram()+
  facet_grid(~Province.State)+
    default_theme+
  theme(legend.position = "bottom")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

write_plot(mobility.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.plot.png"
write_plot(mobility.global.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.global.plot.png"
(plot_data.permobility_summary.plot<-ggplot(plot_data.permobility_summary,aes(x=variable,y=median_change))+
  geom_jitter(size=2,width=.2)+
  #geom_jitter(data=plot_data.permobility_summary %>% arrange(-abs(median_change)) %>% head(n=15),aes(col=Province.State),size=2,width=.2)+
  default_theme+
  ggtitle("Per-Sate Median Change in Mobility")+
  xlab("Mobility Meaure")+
  ylab("Median Change from Baseline"))

write_plot(plot_data.permobility_summary.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/plot_data.permobility_summary.plot.png"

DELIVERABLE MANIFEST

The following link to commited documents pushed to github. These are provided as a convienence, but note this is a manual process. The generation of reports, plots and tables is not coupled to the execution of this markdown. ## Report This report, html & pdf

Plots

github_root<-"https://github.com/sbs87/coronavirus/blob/master/"

plot_handle<-c("Corona_Cases.world.long.plot",
               "Corona_Cases.world.loglong.plot",
               "Corona_Cases.world.mortality.plot",
               "Corona_Cases.world.casecor.plot",
               "Corona_Cases.city.long.plot",
               "Corona_Cases.city.loglong.plot",
               "Corona_Cases.city.mortality.plot",
               "Corona_Cases.city.casecor.plot",
               "Corona_Cases.city.long.normalized.plot",
               "Corona_Cases.US_state.lm.plot",
               "Corona_Cases.US_state.summary.plot")


deliverable_manifest<-data.frame(
  name=c("World total & death cases, longitudinal",
         "World log total & death cases, longitudinal",
         "World mortality",
         "World total & death cases, correlation",
         "City total & death cases, longitudinal",
         "City log total & death cases, longitudinal",
         "City mortality",
         "City total & death cases, correlation",
         "City population normalized total & death cases, longitudinal",
         "State total cases (select) with linear model, longitudinal",
         "State total cases, longitudinal"),
  plot_handle=plot_handle,
  link=paste0(github_root,"results/",plot_handle,".png")
)


(tmp<-data.frame(row_out=apply(deliverable_manifest,MARGIN = 1,FUN = function(x) paste(x[1],x[2],x[3],sep=" | "))))
##                                                                                                                                                                                                        row_out
## 1                                           World total & death cases, longitudinal | Corona_Cases.world.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
## 2                                 World log total & death cases, longitudinal | Corona_Cases.world.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
## 3                                                         World mortality | Corona_Cases.world.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
## 4                                      World total & death cases, correlation | Corona_Cases.world.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
## 5                                              City total & death cases, longitudinal | Corona_Cases.city.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
## 6                                    City log total & death cases, longitudinal | Corona_Cases.city.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
## 7                                                            City mortality | Corona_Cases.city.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
## 8                                         City total & death cases, correlation | Corona_Cases.city.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
## 9  City population normalized total & death cases, longitudinal | Corona_Cases.city.long.normalized.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
## 10                     State total cases (select) with linear model, longitudinal | Corona_Cases.US_state.lm.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
## 11                                      State total cases, longitudinal | Corona_Cases.US_state.summary.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png
row_out<-apply(tmp, 2, paste, collapse="\t\n")
name handle link
World total & death cases, longitudinal Corona_Cases.world.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
World log total & death cases, longitudinal Corona_Cases.world.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
World mortality Corona_Cases.world.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
World total & death cases, correlation Corona_Cases.world.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
City total & death cases, longitudinal Corona_Cases.city.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
City log total & death cases, longitudinal Corona_Cases.city.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
City mortality Corona_Cases.city.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
City total & death cases, correlation Corona_Cases.city.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
City population normalized total & death cases, longitudinal Corona_Cases.city.long.normalized.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
State total cases (select) with linear model, longitudinal Corona_Cases.US_state.lm.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
State total cases, longitudinal Corona_Cases.US_state.summary.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png

Tables

CONCLUSION

Overall, the trends of COVID-19 cases is no longer in log-linear phase for world or U.S. (but some regions like MD are still in the log-linear phase). Mortality rate (deaths/confirmed RNA-based cases) is >1%, with a range depending on region. Mobility is not a strong indicator of caseload (U.S. data).

See table below for detailed breakdown.

Question Answer
What is the effect on social distancing, descreased mobility on case load?
There is not a strong apparent effect on decreased mobility (work, grocery, retail) or increased mobility (at residence, parks) on number of confirmed cases, either as a country (U.S.) or state level. California appears to have one of the best correlations, but this is a mixed bag
What is the trend in cases, mortality across geopgraphical regions?
The confirmed total casees and mortality is overall log-linear for most countries, with a trailing off beginning for most (inlcuding U.S.). On the state level, NY, NJ, PA starting to trail off; MD is still in log-linear phase. Mortality and case load are highly correlated for NY, NJ, PA, MD. The mortality rate flucutates for a given region, but is about 3% overall.

END

End: ##—— Fri Jun 5 17:47:09 2020 ——##

Cheatsheet: http://rmarkdown.rstudio.com>

Sandbox

# Geographical heatmap!
install.packages("maps")
library(maps)
library
mi_counties <- map_data("county", "pennsylvania") %>% 
  select(lon = long, lat, group, id = subregion)
head(mi_counties)

ggplot(mi_counties, aes(lon, lat)) + 
  geom_point(size = .25, show.legend = FALSE) +
  coord_quickmap()
mi_counties$cases<-1:2226
name_overlaps(metadata,Corona_Cases.US_state)

tmp<-merge(Corona_Cases.US_state,metadata)
ggplot(filter(tmp,Province.State=="Pennsylvania"), aes(Long, Lat, group = as.factor(City))) +
  geom_polygon(aes(fill = Total_confirmed_cases), colour = "grey50") + 
  coord_quickmap()


ggplot(Corona_Cases.US_state, aes(Long, Lat))+
  geom_polygon(aes(fill = Total_confirmed_cases ), color = "white")+
  scale_fill_viridis_c(option = "C")
dev.off()


require(maps)
require(viridis)

world_map <- map_data("world")
ggplot(world_map, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill="lightgray", colour = "white")

head(world_map)
head(Corona_Cases.US_state)
unique(select(world_map,c("region","group"))) %>% filter()

some.eu.countries <- c(
  "US"
)
# Retrievethe map data
some.eu.maps <- map_data("world", region = some.eu.countries)

# Compute the centroid as the mean longitude and lattitude
# Used as label coordinate for country's names
region.lab.data <- some.eu.maps %>%
  group_by(region) %>%
  summarise(long = mean(long), lat = mean(lat))

unique(filter(some.eu.maps,subregion %in% Corona_Cases.US_state$Province.State) %>% select(subregion))
unique(Corona_Cases.US_state$Total_confirmed_cases.log)
ggplot(filter(Corona_Cases.US_state,Date=="2020-04-17") aes(x = Long, y = Lat)) +
  geom_polygon(aes( fill = Total_confirmed_cases.log))+
  #geom_text(aes(label = region), data = region.lab.data,  size = 3, hjust = 0.5)+
  #scale_fill_viridis_d()+
  #theme_void()+
  theme(legend.position = "none")
library("sf")
library("rnaturalearth")
library("rnaturalearthdata")

world <- ne_countries(scale = "medium", returnclass = "sf")
class(world)
ggplot(data = world) +
    geom_sf()

counties <- st_as_sf(map("county", plot = FALSE, fill = TRUE))
counties <- subset(counties, grepl("florida", counties$ID))
counties$area <- as.numeric(st_area(counties))
#install.packages("lwgeom")
class(counties)
head(counties)
ggplot(data = world) +
    geom_sf(data=Corona_Cases.US_state) +
    #geom_sf(data = counties, aes(fill = area)) +
  geom_sf(data = counties, aes(fill = area)) +
   # scale_fill_viridis_c(trans = "sqrt", alpha = .4) +
    coord_sf(xlim = c(-88, -78), ylim = c(24.5, 33), expand = FALSE)


head(counties)
tmp<-unique(select(filter(Corona_Cases.US_state,Date=="2020-04-17"),c(Lat,Long,Total_confirmed_cases.per100)))
st_as_sf(map("county", plot = FALSE, fill = TRUE))

join::inner_join.sf(Corona_Cases.US_state, counties)

library(sf)
library(sp)

nc <- st_read(system.file("shape/nc.shp", package="sf"))
class(nc)


spdf <- SpatialPointsDataFrame(coords = select(Corona_Cases.US_state,c("Lat","Long")), data = Corona_Cases.US_state,
                               proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))

head(spdf)
class(spdf)
st_cast(spdf)

filter(Corona_Cases.US_state.summary,Date=="2020-04-20" & Province.State %in% top_states_modified)
id

https://stevenbsmith.net